Skip to content

feat(ci): add priority label to preempt runners for high-priority sweeps#1726

Open
cquil11 wants to merge 1 commit into
mainfrom
cjq/priority-label
Open

feat(ci): add priority label to preempt runners for high-priority sweeps#1726
cquil11 wants to merge 1 commit into
mainfrom
cjq/priority-label

Conversation

@cquil11

@cquil11 cquil11 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Adds a priority modifier label for run-sweep PRs (combine with any sweep label), per the Slack discussion — reserved for day-0-style launches that need the fleet now.

What it does

  • Guard: resolves who added the label from the issue timeline; if they're not a SemiAnalysisAI org member, the label is removed and the run fails.
  • Preempt (in setup): derives target runner labels from the PR's own search space, cancels every other in-progress or queued run with jobs on those runners, waits for them to release, and uploads the victim list as the preempted-runs artifact. Matching goes through each runner's full label set (utils/preempt_runners.py), so e.g. a b200 priority sweep preempts a b200-multinode job occupying a shared node.
  • Restore (restore-preempted, if: always()): re-runs each victim's failed jobs with gh run rerun --failed once the priority sweep finishes.

When it's useful

  • Day-0 model launches and other drop-everything benchmarks that shouldn't queue behind routine sweeps.
  • The preempted work isn't lost — it's re-queued automatically when the priority run completes.

Caveats

  • Preemption is run-granular (GitHub has no per-job cancel API): victim runs also lose in-flight jobs on unrelated SKUs; those get re-run too, but from scratch.
  • If the priority run is cancelled before restore-preempted fires, restore manually from the preempted-runs artifact.
  • rerun --failed on cancelled jobs should be verified once in anger; fallback is per-job POST /actions/jobs/{id}/rerun.

Selection logic is covered by utils/test_preempt_runners.py (11 tests). The priority label has been created in the repo.


Note

Medium Risk
Cancels arbitrary in-flight CI runs across the repo when misused; mitigated by org-only label auth and documented restore path, but collateral run cancellation is inherent.

Overview
Adds a priority modifier label for PR sweeps (used together with an existing sweep label). run-sweep.yml treats priority like other sweep labels for concurrency and re-triggers, then in setup verifies the label was added by an org member (otherwise removes it and fails), cancels other in-progress/queued runs whose jobs compete for the sweep’s runner labels via new utils/preempt_runners.py, uploads victim run IDs as preempted-runs, and after the sweep restore-preempted re-queues victims with gh run rerun --failed.

Preemption is whole-run (GitHub has no per-job cancel); matching uses each runner’s full label set so shared nodes (e.g. b200 vs b200-multinode) are handled correctly. AGENTS.md documents usage and caveats; utils/test_preempt_runners.py covers selection logic.

Reviewed by Cursor Bugbot for commit 6a11a53. Bugbot is set up for automated code reviews on this repo. Configure here.

When an org member adds 'priority' alongside a sweep label, setup
cancels every other in-progress or queued run with jobs on the sweep's
target runners (resolved through each runner's full label set, since
SKU fleets share nodes across labels) and records them in the
preempted-runs artifact. A restore-preempted job at the end re-runs
each victim's failed jobs via gh run rerun --failed. Preemption is
run-granular: GitHub has no per-job cancel API.
@cquil11 cquil11 requested a review from a team June 12, 2026 20:39
Comment on lines +830 to +855
needs: [setup, collect-results, collect-evals]
if: >-
always() &&
github.event_name == 'pull_request' &&
contains(github.event.pull_request.labels.*.name, 'priority') &&
needs.setup.result == 'success'
runs-on: ubuntu-latest
steps:
- name: Download preempted runs artifact
uses: actions/download-artifact@3e5f45b2cfb9172054b4087a40e8e0b5a5461e7c # v8.0.1
with:
name: preempted-runs

- name: Re-run failed jobs of preempted runs
env:
GH_TOKEN: ${{ secrets.REPO_PAT || github.token }}
run: |
count=$(jq 'length' preempted_runs.json)
echo "Restoring ${count} preempted run(s)"
jq -r '.[].run_id' preempted_runs.json | while read -r id; do
if gh run rerun "$id" --failed --repo "${{ github.repository }}"; then
echo "Re-ran failed jobs of run ${id}"
else
echo "::warning::Could not re-run failed jobs of run ${id}; restore manually with: gh run rerun ${id} --failed --repo ${{ github.repository }}"
fi
done

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6a11a53. Configure here.

always() &&
github.event_name == 'pull_request' &&
contains(github.event.pull_request.labels.*.name, 'priority') &&
needs.setup.result == 'success'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preempt without guaranteed restore

Medium Severity

Preemption and artifact upload live in setup, but restore-preempted runs only when needs.setup.result == 'success'. If Upload preempted runs artifact fails after runs were cancelled, the setup job fails and restore never runs, though victims were already preempted.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6a11a53. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants